Urban Greenspace and Asthma Prevalence¶

Get Started with Big Data Pipelines¶

Vegetation has the potential to provide many ecosystem services in urban areas, such as cleaner air and water and flood mitigation. However, the results are mixed on relationships between a simple measurement of vegetation cover (such as average NDVI, a measurement of vegetation health) and human health. We do, however, find relationships between landscape metrics that attempt to quantify the connectivity and structure of greenspace and human health. These types of metrics include mean patch size, edge density, and fragmentation.

In this project, I calculate patch, edge, and fragmentation statistics about urban greenspace in Denber. These statistics should be reflective of the connectivity and spread of urban greenspace, which are important for ecosystem function and access. I then use a linear model to identify statistically significant correlations between the distribution of greenspace and health data compiled by the US Center for Disease Control.

Read More¶

Check out this study by Tsai et al. 2019 on the relationship between edge density and life expectancy in Baltimore, MD. The authors also discuss the influence of scale (e.g. the resolution of the imagery) on these relationships, which is important for this case study.

Full citation: Tsai, Wei-Lun, Yu-Fai Leung, Melissa R. McHale, Myron F. Floyd, and Brian J. Reich. 2019. “Relationships Between Urban Green Land Cover and Human Health at Different Spatial Resolutions.” Urban Ecosystems 22 (2): 315–24. https://doi.org/10.1007/s11252-018-0813-3.

Working with larger-than-memory (big) data¶

For this project, we are going to split up the green space (NDVI) data by census tract, because this matches the human health data from the CDC. If we were interested in the average NDVI at this spatial scale, we could easily use a source of multispectral data like Landsat (30m) or even MODIS (250m) without a noticeable impact on our results. However, because we need to know more about the structure of green space within each tract, we need higher resolution data. For that, we will access the National Agricultural Imagery Program (NAIP) data, which is taken for the continental US every few years at 1m resolution. That’s enough to see individual trees and cars! The main purpose of the NAIP data is, as the name suggests, agriculture. However, it’s also a great source for urban areas where lots is happening on a very small scale.

The NAIP data for the City of Chicago takes up about 20GB of space. This amount of data is likely to crash your kernel if you try to load it all in at once. It also would be inconvenient to store on your harddrive so that you can load it in a bit at a time for analysis. Even if you are using a computer that would be able to handle this amount of data, imagine if you were analysing the entire United States over multiple years!

To help with this problem, you will use cloud-based tools to calculate your statistics instead of downloading rasters to your computer or container. You can crop the data entirely in the cloud, thereby saving on your memory and internet connection, using rioxarray.

STEP 1: Set up your analysis¶

Try It

As always, before you get started:

  1. Import necessary packages
  2. Create reproducible file paths for your project file structure.
  3. To use cloud-optimized GeoTiffs, we recommend some settings to make sure your code does not get stopped by a momentary connection lapse – see the code cell below.
In [1]:
### Import libraries
import os
import pathlib
import numpy as np

import rasterio
import geopandas as gpd
import pandas as pd

import xarray
import rioxarray as rxr
import rioxarray.merge as rxrmerge
import shapely
import time

## See Warnings
import warnings

## Add visual packages
import geoviews as gv
import holoviews as hv
import hvplot.pandas # Plot vector data
import hvplot.xarray
from cartopy import crs as ccrs

## Data exploration and visualization packages
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats

## Image Processing packages
from scipy.ndimage import convolve
from scipy.ndimage import label

## Using data in the cloud Packages
import pystac_client

## Progress Bar
from tqdm.notebook import tqdm

## OLS sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
In [2]:
!pip install sodapy # To get sodapy
Requirement already satisfied: sodapy in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (2.2.0)
Requirement already satisfied: requests>=2.28.1 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from sodapy) (2.32.5)
Requirement already satisfied: charset_normalizer<4,>=2 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from requests>=2.28.1->sodapy) (3.4.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from requests>=2.28.1->sodapy) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from requests>=2.28.1->sodapy) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/miniconda3/envs/earth-analytics-python/lib/python3.11/site-packages (from requests>=2.28.1->sodapy) (2025.11.12)
In [3]:
from sodapy import Socrata
In [4]:
### Create reproducible file paths

data_dir = os.path.join(pathlib.Path.home(), "Earth Data Analytics", "Spring 26", "greenspaces", "data",)
os.makedirs(data_dir, exist_ok=True) 
In [5]:
### Prevent GDAL from quitting due to momentary disruptions
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"

STEP 2: Create a site map¶

We’ll be using the Center for Disease Control (CDC) Places dataset for human health data to compare with vegetation metrics. CDC Places also provides some modified census tracts, clipped to the city boundary, to go along with the health data. We’ll start by downloading the matching geographic data, and then select the City of Chicago.

Try It
  1. Download the Census tract Shapefile that goes along with CDC Places
  2. Use a row filter to select only the census tracts in Chicago
  3. Use a conditional statement to cache your download. There is no need to cache the full dataset – stick with your pared down version containing only Chicago.
In [6]:
### Set up the census tract path
tract_dir = os.path.join(data_dir, "Denver_tract")
os.makedirs(tract_dir, exist_ok=True)

tract_path = os.path.join(tract_dir, "Denver_tract.shp")

### Download the census tracts (only once)
if not os.path.exists(tract_path):
    tract_url = ("https://data.cdc.gov/download/x7zy-2xmx/application%2Fzip") # Census URL
    tract_gdf = gpd.read_file(tract_url) # Read the .shp file
    den_gdf = tract_gdf[tract_gdf.PlaceName=="Denver"] # Filter only Chicago Data
    den_gdf.to_file(tract_path, index=False) # Save this output


### Load in census tract data
den_tract_gdf = gpd.read_file(tract_path)
den_tract_gdf
Out[6]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry
0 0820000 08031000102 08 Denver 0820000-08031000102 3109 POLYGON ((-11691351.798 4834636.885, -11691351...
1 0820000 08031000201 08 Denver 0820000-08031000201 3874 POLYGON ((-11688301.532 4835632.272, -11688302...
2 0820000 08031000202 08 Denver 0820000-08031000202 3916 POLYGON ((-11688362.201 4834372.228, -11688360...
3 0820000 08031000301 08 Denver 0820000-08031000301 5003 POLYGON ((-11691355.36 4833538.467, -11691357....
4 0820000 08031000302 08 Denver 0820000-08031000302 4036 POLYGON ((-11692926.635 4832494.047, -11692925...
... ... ... ... ... ... ... ...
139 0820000 08031015500 08 Denver 0820000-08031015500 3313 MULTIPOLYGON (((-11681197.008 4821951.258, -11...
140 0820000 08031015600 08 Denver 0820000-08031015600 6498 POLYGON ((-11688537.194 4817797.736, -11688985...
141 0820000 08031015700 08 Denver 0820000-08031015700 5535 POLYGON ((-11691342.668 4817773.296, -11691342...
142 0820000 08031980000 08 Denver 0820000-08031980000 1165 POLYGON ((-11646235.781 4840781.884, -11646415...
143 0820000 08031980100 08 Denver 0820000-08031980100 0 POLYGON ((-11671535.922 4836666.59, -11671535....

144 rows × 7 columns

In [7]:
### Site plot -- Census tracts with satellite imagery in the background
(
    den_tract_gdf

    ## Set projection
    .to_crs(ccrs.Mercator())

    ## Plot with satellite imagery in the background
    .hvplot (
        line_color = 'orange', fill_color = None,
        crs = ccrs.Mercator(), tiles = 'EsriImagery',
        frame_width = 600
    )
)
Out[7]:
Reflect and Respond

What do you notice about the City of Chicago from the coarse satellite image? Is green space evenly distributed? What can you learn about Chicago from websites, scientific papers, or other resources that might help explain what you see in the site map?

In this map, we can clearly see that there is significant greenspace along the eastern border of Denver. Notably there is much less greenspace in the central area and along the highway corridor. We would expect the corridor around I25 to be lacking greenspace due to built up infrastructure in the area. It is also important to onte that the census block in the north-east corner of the map indicates the Denver International Airport and is not an area that is heavily populated. Another important feature to note of this area is the Rocky Mountain Arsenal in the north-east corner

It's still a little difficult to identify how much greenspace exists from satellite images alone, so I'm looking forward to diving into it more.

STEP 3 - Access Asthma and Urban Greenspaces Data¶

Step 3a - Access human health data¶

The U.S. Center for Disease Control (CDC) provides a number of health variables through their Places Dataset that might be correlated with urban greenspace. For this assignment, start with adult asthma. Try to limit the data as much as possible for download. Selecting the state and county is a one way to do this.

Try It
  1. You can access Places data with an API, but as with many APIs it is easier to test out your search before building a URL. Navigate to the Places Census Tract Data Portal and search for the data you want.
  2. The data portal will make an API call for you, but there is a simpler, easier to read and modify way to form an API call. Check out to the socrata documentation to see how. You can also find some limited examples and a list of available parameters for this API on CDC Places SODA Consumer API Documentation.
  3. Once you have formed your query, you may notice that you have exactly 1000 rows. The Places SODA API limits you to 1000 records in a download. Either narrow your search or check out the &$limit= parameter to increase the number of rows downloaded. You can find more information on the Paging page of the SODA API documentation
  4. You should also clean up this data by renaming the 'data_value' to something descriptive, and possibly selecting a subset of columns.
In [8]:
### Set up a path for the asthma data
cdc_path = os.path.join(data_dir, 'asthma-den1.csv')

### Download asthma data (only once)
if not os.path.exists(cdc_path):
    
    print("Working...")

    cdc_url = (
        
        ## url for data
        "https://data.cdc.gov/resource/cwsq-ngmh.csv"

        ## parameters to filter on, selecting data for Asthma rates in Denver in 2023
        "?year=2023"
        "&stateabbr=CO"
        "&countyname=Denver"
        "&measureid=CASTHMA"
        "&$limit=1500"
    )

    cdc_df = (
        ##open it as a CSV
        pd.read_csv(cdc_url)

        ## Rename columns
        .rename(columns = {
            'data_value': 'asthma',
            'low_confidence_limit': 'asthma_ci_low',
            'high_confidence_limit': 'asthma_ci_high',
            'locationname': 'tract'})

        ## Select columns to keep
        [[
            'year','tract',
            'asthma', 'asthma_ci_low', 'asthma_ci_high',
            'data_value_unit', 'totalpopulation',
            'totalpop18plus'
        ]]
    )

    ## Download to a CSV
    cdc_df.to_csv(cdc_path, index = False)

### Load in asthma data
cdc_df = pd.read_csv(cdc_path)

### Preview asthma data
cdc_df
Out[8]:
year tract asthma asthma_ci_low asthma_ci_high data_value_unit totalpopulation totalpop18plus
0 2023 8031000301 9.9 8.9 11.1 % 5779 4885
1 2023 8031000401 10.4 9.2 11.7 % 3174 2735
2 2023 8031000201 9.8 8.6 10.9 % 3913 3222
3 2023 8031000705 10.5 9.4 11.7 % 3134 2414
4 2023 8031001102 10.0 8.9 11.2 % 5274 4868
... ... ... ... ... ... ... ... ...
171 2023 8031005104 10.4 9.3 11.6 % 3470 2956
172 2023 8031015300 10.2 9.1 11.4 % 3697 3306
173 2023 8031006903 10.4 9.4 11.6 % 2475 2105
174 2023 8031006816 10.7 9.6 12.0 % 5511 4570
175 2023 8031006902 11.0 9.9 12.2 % 3638 3032

176 rows × 8 columns

Step 3b - Join health data with census tract boundaries¶

Try It
  1. Join the census tract GeoDataFrame with the asthma prevalence DataFrame using the .merge() method.
  2. You will need to change the data type of one of the merge columns to match, e.g. using .astype('int64')
  3. There are a few census tracts in Chicago that do not have data. You should be able to confirm that they are not listed through the interactive Places Data Portal. However, if you have large chunks of the city missing, it may mean that you need to expand the record limit for your download.
In [9]:
### Change tract identifier datatype for merging
den_tract_gdf.tract2010 = den_tract_gdf.tract2010.astype('int64')

### Merge census data with geometry
tract_cdc_gdf = (
    den_tract_gdf
    .merge(cdc_df,
                 
            ## specify colummns to merge on
            left_on = 'tract2010',
            right_on = 'tract',

            ## use inner join, to keep call
            how = 'inner'
            )
)

tract_cdc_gdf
Out[9]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry year tract asthma asthma_ci_low asthma_ci_high data_value_unit totalpopulation totalpop18plus
0 0820000 8031000102 08 Denver 0820000-08031000102 3109 POLYGON ((-11691351.798 4834636.885, -11691351... 2023 8031000102 10.2 9.1 11.4 % 3622 3017
1 0820000 8031000201 08 Denver 0820000-08031000201 3874 POLYGON ((-11688301.532 4835632.272, -11688302... 2023 8031000201 9.8 8.6 10.9 % 3913 3222
2 0820000 8031000202 08 Denver 0820000-08031000202 3916 POLYGON ((-11688362.201 4834372.228, -11688360... 2023 8031000202 10.2 9.1 11.4 % 4042 3206
3 0820000 8031000301 08 Denver 0820000-08031000301 5003 POLYGON ((-11691355.36 4833538.467, -11691357.... 2023 8031000301 9.9 8.9 11.1 % 5779 4885
4 0820000 8031000302 08 Denver 0820000-08031000302 4036 POLYGON ((-11692926.635 4832494.047, -11692925... 2023 8031000302 10.0 8.9 11.2 % 4412 3731
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
107 0820000 8031015300 08 Denver 0820000-08031015300 3832 POLYGON ((-11679896.574 4824084.018, -11679896... 2023 8031015300 10.2 9.1 11.4 % 3697 3306
108 0820000 8031015400 08 Denver 0820000-08031015400 3934 POLYGON ((-11691351.798 4835608.508, -11691351... 2023 8031015400 10.6 9.5 11.9 % 4487 3795
109 0820000 8031015500 08 Denver 0820000-08031015500 3313 MULTIPOLYGON (((-11681197.008 4821951.258, -11... 2023 8031015500 10.8 9.6 12.1 % 3399 3019
110 0820000 8031015600 08 Denver 0820000-08031015600 6498 POLYGON ((-11688537.194 4817797.736, -11688985... 2023 8031015600 9.8 8.7 10.8 % 6618 4813
111 0820000 8031015700 08 Denver 0820000-08031015700 5535 POLYGON ((-11691342.668 4817773.296, -11691342... 2023 8031015700 10.1 9.0 11.2 % 5907 4625

112 rows × 15 columns

In [10]:
### Plot asthma data as chloropleth
(
    ## load basemap with satellite imagery
    gv.tile_sources.EsriImagery
    
    *

    ## map census tracts with asthma data
    gv.Polygons(

        ## reproject to align CRSs
        tract_cdc_gdf.to_crs(ccrs.Mercator()),

        ## select cols
        vdims = ['asthma', 'tract2010'],

        ## ensure CRSs align
        crs = ccrs.Mercator()
    ).opts(color = 'asthma', colorbar = True, colorbar_opts = {'title':'Percent Asthma Rate'}, title = 'Adult Asthma Rates across Census Tracts', tools = ['hover'])
).opts(width = 600, height = 600, xaxis = None, yaxis = None)
Out[10]:
Reflect and Respond

Write a description and citation for the asthma prevalence data. Do you notice anything about the spatial distribution of asthma in Chicago? From your research on the city, what might be some potential causes of any patterns you see?

Looking at the prevalence of asthma on the CDC website, we can see that some areas of Denver are above the national average of asthma prevalence The national asthma prevalence in 2022 was 8.2%. Much of Denver falls along this range, but there are some areas (in the north and east) where the rate is over 10% and closer to 12-14%. This could be due to significant industrial manufacturing and companies in the area as well as urbanization and lack of greenspaces. It is also interesting to consider if the proximity to not just highway infrastrcture but airplane traffic or industrial pollution (Commerce City) also play a role. There is another tract with a very high asthma rate in central Denver. It would be interesting to see what other demographic variables in that region play a role

Step 3c - Get data URLs for urban greenspace imagery¶

NAIP data are freely available through the Microsoft Planetary Computer SpatioTemporal Access Catalog (STAC).

Try It

Get started with STAC by accessing the planetary computer catalog with the following code:

e84_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)
In [11]:
### Connect to the planetary computer catalog
e84_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
    
)
Try It
  1. Using a loop, for each Census Tract:

    1. Use the following sample code to search for data, replacing the names with applicable values with descriptive names:

      search = e84_catalog.search(
          collections=["naip"],
          intersects=shapely.to_geojson(tract_geometry),
          datetime=f"{year}"
      )
    2. Access the url using search.assets['image'].href

  2. Accumulate the urls in a pd.DataFrame or dict for later

  3. Occasionally you may find that the STAC service is momentarily unavailable. You should include code that will retry the request up to 5 times when you get the pystac_client.exceptions.APIError.

Warning

As always – DO NOT try to write this loop all at once! Stick with one step at a time, making sure to test your work. You also probably want to add a break into your loop to stop the loop after a single iteration. This will help prevent long waits during debugging.

In [16]:
### Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)

### Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, 'den-ndvi-stats3.csv')


### Check for existing data - do not access duplicate tracts

## Initialize an empty list for the census tract IDs
downloaded_tracts = []

## write conditional 
if os.path.exists(ndvi_stats_path):

    ## if it exists, open the csvs in the path
    ndvi_stats_df = pd.read_csv(ndvi_stats_path)

    ##fill the list with the tract values
    downloaded_tracts = ndvi_stats_df.tract.values

## if it doesn't exist, print statement
else: 
    print('No census tracts downloaded so far.')

### Loop through each census tract

## Make list to accumulate data
scene_dfs = []

## Loop through each tract value in the gdf
for i, tract_values in tqdm(tract_latlong_gdf.iterrows()): #progress bar

    ## for each i, extract the tract id
    tract = tract_values.tract2010

    ### Check if statistics are already downloaded for this tract
    if not (tract in downloaded_tracts):

        ### Repeat up to 5 times in case of a momentary disruption 
        i = 0 ## initialize retry counter, if needed
        retry_limit = 5 ## max number of tries
        while i < retry_limit:

            ### Try accessing the STAC
            try:

                ### Search for tiles 
                naip_search = e84_catalog.search(
                collections = ['naip'],
                intersects = shapely.to_geojson(tract_values.geometry),
                datetime = "2021"
                )


                ### Build dataframe with tracts and tile urls
                scene_dfs.append(pd.DataFrame(dict(

                    ## column called tract
                    tract = tract,

                    date = [pd.to_datetime(scene.datetime).date()
                            
                            for scene in naip_search.items()], # each scene is one naip image

                    rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],
                )))

                ## Exit retry loop once STAC request succeeds
                break

            ### Try again in case of an APIError
            except pystac_client.exceptions.APIError:
                print(
                    f'Could not connect with STAC server.'
                    f'Retrying tract {tract}...')
                
                ## Wait 2 seconds before retrying
                time.sleep(2)
                i += 1
                continue



### Concatenate the url dataframes
if scene_dfs:

    ## concatenate
    scene_df = pd.concat(scene_dfs).reset_index(drop = True)

## Otherwise, tell me it didn't work
else:
    scene_df = None

### Preview the url dataframe
scene_df
No census tracts downloaded so far.
0it [00:00, ?it/s]
Out[16]:
tract date rgbir_href
0 8031000102 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
1 8031000201 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
2 8031000201 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
3 8031000202 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
4 8031000202 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
... ... ... ...
180 8031015500 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
181 8031015500 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
182 8031015600 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
183 8031015600 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
184 8031015700 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...

185 rows × 3 columns

In [17]:
print(scene_df)
          tract        date                                         rgbir_href
0    8031000102  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
1    8031000201  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
2    8031000201  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
3    8031000202  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
4    8031000202  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
..          ...         ...                                                ...
180  8031015500  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
181  8031015500  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
182  8031015600  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
183  8031015600  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
184  8031015700  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...

[185 rows x 3 columns]
In [18]:
## Wtrite out scene_df as a csv
scene_df.to_csv(ndvi_stats_path, index = False)

Step 3d- Compute NDVI Statistics¶

Next, calculate some metrics to get at different aspects of the distribution of greenspace in each census tract. These types of statistics are called fragmentation statistics, and can be implemented with the scipy package. Some examples of fragmentation statistics are:

Percentage vegetation: The percentage of pixels that exceed a vegetation threshold (.12 is common with Landsat) Mean patch size

Mean patch size: The average size of patches, or contiguous area exceeding the vegetation threshold. Patches can be identified with the label function from scipy.ndimage

Edge density: The proportion of edge pixels among vegetated pixels. Edges can be identified by convolving the image with a kernel designed to detect pixels that are different from their surroundings.

What is convolution?

If you are familiar with differential equations, convolution is an approximation of the LaPlace transform.

For the purposes of calculating edge density, convolution means that we are taking all the possible 3x3 chunks for our image, and multiplying it by the kernel:

$$ \text{Kernel} = \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$

The result is a matrix the same size as the original, minus the outermost edge. If the center pixel is the same as the surroundings, its value in the final matrix will be $-8 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 0$. If it is higher than the surroundings, the result will be negative, and if it is lower than the surroundings, the result will be positive. As such, the edge pixels of our patches will be negative.

Try It
  1. Select a single row from the census tract GeoDataFrame using e.g. .loc[[0]], then select all the rows from your URL DataFrame that match the census tract.
  2. For each URL in crop, merge, clip, and compute NDVI for that census tract
  3. Set a threshold to get a binary mask of vegetation
  4. Using the sample code to compute the fragmentation statistics. Feel free to add any other statistics you think are relevant, but make sure to include a fraction vegetation, mean patch size, and edge density. If you are not sure what any line of code is doing, make a plot or print something to find out! You can also ask ChatGPT or the class.
In [19]:
## Open df of NAIP imagery URLs
scene_df = pd.read_csv(ndvi_stats_path)
scene_df
Out[19]:
tract date rgbir_href
0 8031000102 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
1 8031000201 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
2 8031000201 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
3 8031000202 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
4 8031000202 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
... ... ... ...
180 8031015500 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
181 8031015500 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
182 8031015600 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
183 8031015600 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...
184 8031015700 2021-07-28 https://naipeuwest.blob.core.windows.net/naip/...

185 rows × 3 columns

In [20]:
## Pull out first tract value
tract = den_tract_gdf.loc[0].tract2010

## Convert to string from integer
tract = str(tract)
tract
Out[20]:
'8031000102'
In [21]:
href_value = scene_df.loc[
    scene_df.tract.astype(str) == tract,

    ## column to extract
    "rgbir_href"
].iloc[0] ## Just grab 1

href_value
Out[21]:
'https://naipeuwest.blob.core.windows.net/naip/v002/co/2021/co_060cm_2021/39105/16/m_3910516_se_13_060_20210728.tif'
In [22]:
## Process 1 scene

## Open the raster
tile_da = rxr.open_rasterio(

        ## give it URL
        href_value,
        ## mask out data pixels
        masked = True

## Remove dimensions of length = 1
).squeeze()

## Check it out
tile_da
Out[22]:
<xarray.DataArray (band: 4, y: 12230, x: 9600)> Size: 2GB
[469632000 values with dtype=float32]
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * y            (y) float64 98kB 4.407e+06 4.407e+06 ... 4.4e+06 4.4e+06
  * x            (x) float64 77kB 4.944e+05 4.944e+05 ... 5.002e+05 5.002e+05
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_IMAGEDESCRIPTION:  OrthoVista
    TIFFTAG_SOFTWARE:          Trimble Germany GmbH
    TIFFTAG_XRESOLUTION:       1
    TIFFTAG_YRESOLUTION:       1
    TIFFTAG_RESOLUTIONUNIT:    1 (unitless)
    AREA_OR_POINT:             Area
    scale_factor:              1.0
    add_offset:                0.0
xarray.DataArray
  • band: 4
  • y: 12230
  • x: 9600
  • ...
    [469632000 values with dtype=float32]
    • band
      (band)
      int64
      1 2 3 4
      array([1, 2, 3, 4])
    • y
      (y)
      float64
      4.407e+06 4.407e+06 ... 4.4e+06
      array([4407149.7, 4407149.1, 4407148.5, ..., 4399813.5, 4399812.9, 4399812.3],
            shape=(12230,))
    • x
      (x)
      float64
      4.944e+05 4.944e+05 ... 5.002e+05
      array([494442.3, 494442.9, 494443.5, ..., 500200.5, 500201.1, 500201.7],
            shape=(9600,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 13N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-105],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26913"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 13N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -105.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 13N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-105],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26913"]]
      GeoTransform :
      494442.0 0.6 0.0 4407150.0 0.0 -0.6
      array(0)
  • TIFFTAG_IMAGEDESCRIPTION :
    OrthoVista
    TIFFTAG_SOFTWARE :
    Trimble Germany GmbH
    TIFFTAG_XRESOLUTION :
    1
    TIFFTAG_YRESOLUTION :
    1
    TIFFTAG_RESOLUTIONUNIT :
    1 (unitless)
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
In [23]:
## Check out dimensions
tile_da.dims
Out[23]:
('band', 'y', 'x')
In [24]:
## Reminder what is in columns
print(scene_df.columns)
Index(['tract', 'date', 'rgbir_href'], dtype='object')
In [25]:
## Check out the data
tract_cdc_gdf
Out[25]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry year tract asthma asthma_ci_low asthma_ci_high data_value_unit totalpopulation totalpop18plus
0 0820000 8031000102 08 Denver 0820000-08031000102 3109 POLYGON ((-11691351.798 4834636.885, -11691351... 2023 8031000102 10.2 9.1 11.4 % 3622 3017
1 0820000 8031000201 08 Denver 0820000-08031000201 3874 POLYGON ((-11688301.532 4835632.272, -11688302... 2023 8031000201 9.8 8.6 10.9 % 3913 3222
2 0820000 8031000202 08 Denver 0820000-08031000202 3916 POLYGON ((-11688362.201 4834372.228, -11688360... 2023 8031000202 10.2 9.1 11.4 % 4042 3206
3 0820000 8031000301 08 Denver 0820000-08031000301 5003 POLYGON ((-11691355.36 4833538.467, -11691357.... 2023 8031000301 9.9 8.9 11.1 % 5779 4885
4 0820000 8031000302 08 Denver 0820000-08031000302 4036 POLYGON ((-11692926.635 4832494.047, -11692925... 2023 8031000302 10.0 8.9 11.2 % 4412 3731
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
107 0820000 8031015300 08 Denver 0820000-08031015300 3832 POLYGON ((-11679896.574 4824084.018, -11679896... 2023 8031015300 10.2 9.1 11.4 % 3697 3306
108 0820000 8031015400 08 Denver 0820000-08031015400 3934 POLYGON ((-11691351.798 4835608.508, -11691351... 2023 8031015400 10.6 9.5 11.9 % 4487 3795
109 0820000 8031015500 08 Denver 0820000-08031015500 3313 MULTIPOLYGON (((-11681197.008 4821951.258, -11... 2023 8031015500 10.8 9.6 12.1 % 3399 3019
110 0820000 8031015600 08 Denver 0820000-08031015600 6498 POLYGON ((-11688537.194 4817797.736, -11688985... 2023 8031015600 9.8 8.7 10.8 % 6618 4813
111 0820000 8031015700 08 Denver 0820000-08031015700 5535 POLYGON ((-11691342.668 4817773.296, -11691342... 2023 8031015700 10.1 9.0 11.2 % 5907 4625

112 rows × 15 columns

In [26]:
tract_cdc_gdf["tract2010"].head()
Out[26]:
0    8031000102
1    8031000201
2    8031000202
3    8031000301
4    8031000302
Name: tract2010, dtype: int64
In [27]:
tract_cdc_gdf["tract2010"].dtype
Out[27]:
dtype('int64')
In [28]:
tract in tract_cdc_gdf["tract2010"].astype(str).values
Out[28]:
True
In [29]:
## Get boundary of the tract we're working with
boundary = (
    tract_cdc_gdf

    ## Deal with issues with integers
    .assign(tract2010 = lambda df: df["tract2010"].astype(str))

    ## Use tract ID as the index
    .set_index("tract2010")

    ## Grab the tract we're using
    .loc[[str(tract)]]

    ## Make sure the CRS match
    .to_crs(tile_da.rio.crs)

    ## Grab geometry
    .geometry
)

boundary
Out[29]:
tract2010
8031000102    POLYGON ((497842.209 4403809.736, 497842.293 4...
Name: geometry, dtype: geometry
In [30]:
## Crop to bounding box
crop_da = tile_da.rio.clip_box(

    ## Get coordinates of the bounding box
    *boundary.envelope.total_bounds,

    ## Let it expand a litte
    auto_expand = True
)

crop_da
Out[30]:
<xarray.DataArray (band: 4, y: 1562, x: 4006)> Size: 100MB
[25029488 values with dtype=float32]
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * y            (y) float64 12kB 4.404e+06 4.404e+06 ... 4.403e+06 4.403e+06
  * x            (x) float64 32kB 4.954e+05 4.954e+05 ... 4.978e+05 4.978e+05
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_IMAGEDESCRIPTION:  OrthoVista
    TIFFTAG_SOFTWARE:          Trimble Germany GmbH
    TIFFTAG_XRESOLUTION:       1
    TIFFTAG_YRESOLUTION:       1
    TIFFTAG_RESOLUTIONUNIT:    1 (unitless)
    AREA_OR_POINT:             Area
    scale_factor:              1.0
    add_offset:                0.0
xarray.DataArray
  • band: 4
  • y: 1562
  • x: 4006
  • ...
    [25029488 values with dtype=float32]
    • band
      (band)
      int64
      1 2 3 4
      array([1, 2, 3, 4])
    • y
      (y)
      float64
      4.404e+06 4.404e+06 ... 4.403e+06
      axis :
      Y
      long_name :
      y coordinate of projection
      standard_name :
      projection_y_coordinate
      units :
      metre
      array([4403897.1, 4403896.5, 4403895.9, ..., 4402961.7, 4402961.1, 4402960.5],
            shape=(1562,))
    • x
      (x)
      float64
      4.954e+05 4.954e+05 ... 4.978e+05
      axis :
      X
      long_name :
      x coordinate of projection
      standard_name :
      projection_x_coordinate
      units :
      metre
      array([495439.5, 495440.1, 495440.7, ..., 497841.3, 497841.9, 497842.5],
            shape=(4006,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 13N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-105],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26913"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 13N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -105.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 13N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-105],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26913"]]
      GeoTransform :
      495439.2 0.6 0.0 4403897.4 0.0 -0.600000000000358
      array(0)
  • TIFFTAG_IMAGEDESCRIPTION :
    OrthoVista
    TIFFTAG_SOFTWARE :
    Trimble Germany GmbH
    TIFFTAG_XRESOLUTION :
    1
    TIFFTAG_YRESOLUTION :
    1
    TIFFTAG_RESOLUTIONUNIT :
    1 (unitless)
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
In [31]:
## Clip to exact geometry of the census tract
clip_da = crop_da.rio.clip(

    boundary,

    ## Include all pixels touching the boundary
    all_touched = True)

clip_da
Out[31]:
<xarray.DataArray (band: 4, y: 1562, x: 4006)> Size: 100MB
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]],
      shape=(4, 1562, 4006), dtype=float32)
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * y            (y) float64 12kB 4.404e+06 4.404e+06 ... 4.403e+06 4.403e+06
  * x            (x) float64 32kB 4.954e+05 4.954e+05 ... 4.978e+05 4.978e+05
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_IMAGEDESCRIPTION:  OrthoVista
    TIFFTAG_SOFTWARE:          Trimble Germany GmbH
    TIFFTAG_XRESOLUTION:       1
    TIFFTAG_YRESOLUTION:       1
    TIFFTAG_RESOLUTIONUNIT:    1 (unitless)
    AREA_OR_POINT:             Area
    scale_factor:              1.0
    add_offset:                0.0
xarray.DataArray
  • band: 4
  • y: 1562
  • x: 4006
  • nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan
    array([[[nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            ...,
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan]],
    
           [[nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            ...,
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan]],
    
           [[nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            ...,
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan]],
    
           [[nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            ...,
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan]]],
          shape=(4, 1562, 4006), dtype=float32)
    • band
      (band)
      int64
      1 2 3 4
      array([1, 2, 3, 4])
    • y
      (y)
      float64
      4.404e+06 4.404e+06 ... 4.403e+06
      axis :
      Y
      long_name :
      y coordinate of projection
      standard_name :
      projection_y_coordinate
      units :
      metre
      array([4403897.1, 4403896.5, 4403895.9, ..., 4402961.7, 4402961.1, 4402960.5],
            shape=(1562,))
    • x
      (x)
      float64
      4.954e+05 4.954e+05 ... 4.978e+05
      axis :
      X
      long_name :
      x coordinate of projection
      standard_name :
      projection_x_coordinate
      units :
      metre
      array([495439.5, 495440.1, 495440.7, ..., 497841.3, 497841.9, 497842.5],
            shape=(4006,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 13N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-105],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26913"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 13N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -105.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 13N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-105],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26913"]]
      GeoTransform :
      495439.2 0.6 0.0 4403897.4 0.0 -0.600000000000358
      array(0)
  • TIFFTAG_IMAGEDESCRIPTION :
    OrthoVista
    TIFFTAG_SOFTWARE :
    Trimble Germany GmbH
    TIFFTAG_XRESOLUTION :
    1
    TIFFTAG_YRESOLUTION :
    1
    TIFFTAG_RESOLUTIONUNIT :
    1 (unitless)
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
In [32]:
clip_da.shape
Out[32]:
(4, 1562, 4006)

NAIP / NDVI Notes

Band 1 is Visible Red Band 2 is Visible Green Band 3 is Visiblue Blue Band 4 is Near-Infrared (NIR)

NDVI = (NIR - Red) / (NIR + Red)

-1 to 0 NDVI = Dead plants or inanimate objects 0 - 0.33 NDVI = Unhealthy plants 0.33 - 0.66 NDVI = Moderately Healthy plants 0.66 - 1 NDVI = Very Healthy plants

In [33]:
## Do NDVI math
ndvi_da = (
    (clip_da.sel(band = 4) - clip_da.sel(band = 1))
    / (clip_da.sel(band = 4) + clip_da.sel(band = 1))
)

ndvi_da
Out[33]:
<xarray.DataArray (y: 1562, x: 4006)> Size: 25MB
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]],
      shape=(1562, 4006), dtype=float32)
Coordinates:
  * y            (y) float64 12kB 4.404e+06 4.404e+06 ... 4.403e+06 4.403e+06
  * x            (x) float64 32kB 4.954e+05 4.954e+05 ... 4.978e+05 4.978e+05
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_IMAGEDESCRIPTION:  OrthoVista
    TIFFTAG_SOFTWARE:          Trimble Germany GmbH
    TIFFTAG_XRESOLUTION:       1
    TIFFTAG_YRESOLUTION:       1
    TIFFTAG_RESOLUTIONUNIT:    1 (unitless)
    AREA_OR_POINT:             Area
    scale_factor:              1.0
    add_offset:                0.0
xarray.DataArray
  • y: 1562
  • x: 4006
  • nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan
    array([[nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           ...,
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan]],
          shape=(1562, 4006), dtype=float32)
    • y
      (y)
      float64
      4.404e+06 4.404e+06 ... 4.403e+06
      axis :
      Y
      long_name :
      y coordinate of projection
      standard_name :
      projection_y_coordinate
      units :
      metre
      array([4403897.1, 4403896.5, 4403895.9, ..., 4402961.7, 4402961.1, 4402960.5],
            shape=(1562,))
    • x
      (x)
      float64
      4.954e+05 4.954e+05 ... 4.978e+05
      axis :
      X
      long_name :
      x coordinate of projection
      standard_name :
      projection_x_coordinate
      units :
      metre
      array([495439.5, 495440.1, 495440.7, ..., 497841.3, 497841.9, 497842.5],
            shape=(4006,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 13N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-105],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26913"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 13N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -105.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 13N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-105],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26913"]]
      GeoTransform :
      495439.2 0.6 0.0 4403897.4 0.0 -0.600000000000358
      array(0)
  • TIFFTAG_IMAGEDESCRIPTION :
    OrthoVista
    TIFFTAG_SOFTWARE :
    Trimble Germany GmbH
    TIFFTAG_XRESOLUTION :
    1
    TIFFTAG_YRESOLUTION :
    1
    TIFFTAG_RESOLUTIONUNIT :
    1 (unitless)
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
In [34]:
print(scene_df)
          tract        date                                         rgbir_href
0    8031000102  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
1    8031000201  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
2    8031000201  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
3    8031000202  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
4    8031000202  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
..          ...         ...                                                ...
180  8031015500  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
181  8031015500  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
182  8031015600  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
183  8031015600  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...
184  8031015700  2021-07-28  https://naipeuwest.blob.core.windows.net/naip/...

[185 rows x 3 columns]
In [35]:
### Skip this step if data are already downloaded 
if not scene_df is None: ## conditional

    ### Get an example 
    ## PUll out the identifier for an example tract
    tract = den_tract_gdf.loc[0].tract2010

    ## subset scene_df to only include that tract
    ex_scene_gdf = scene_df[scene_df.tract == tract]

    ### Loop through all images for tract

    ## Make a list to accumulate into
    tile_das = [] ## iniiatie an empty list

    ## Loop through all the images for this tract
    for _, href_s in ex_scene_gdf.iterrows():

        ### Open vsi connection to data
        tile_da = rxr.open_rasterio(

            ##location of the raster
            href_s.rgbir_href,

            ## deal with no data, remove extra dimensions
            masked = True).squeeze()
        

        ### Crop data to the bounding box of the census tract

        ## Make the bounding box
        boundary = (
            tract_cdc_gdf

            ## Use tract ID as index
            .set_index('tract2010')

            ##s select geometry for the tract
            .loc[[tract]]

            # Reproject
            .to_crs(tile_da.rio.crs)

            # Extract Geometry
            .geometry
        )

        ## Crop to bounding xo
        crop_da = tile_da.rio.clip_box(

            ## get coordinates for the boudning box
            *boundary.envelope.total_bounds,

            ## let expand
            auto_expand = True
        )

        ### Clip data to the boundary of the census tract
        clip_da = crop_da.rio.clip(
            boundary, all_touched = True)
    


        ### Compute NDVI
        ndvi_da = (
            (clip_da.sel(band = 4) - clip_da.sel(band = 1)) / 
            (clip_da.sel(band = 4) + clip_da.sel(band = 1)) 

        )

        ### Accumulate result
        tile_das.append(ndvi_da)

## Check
print(f"Number of tiles in tiles_da: {len(tile_das)}")

### Merge data into a single object
scene_da = rxrmerge.merge_arrays(tile_das)

### Mask vegetation
veg_mask = (scene_da > .3) # Greater than moderately healthy plants


## Count Patch pixels

### Calculate mean patch size
labeled_patches, num_patches = label(veg_mask)
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
mean_patch_size = patch_sizes.mean()

## Check it out
patch_sizes
Number of tiles in tiles_da: 1
Out[35]:
array([23, 62,  1, ...,  6,  2,  1], shape=(8043,))
In [36]:
mean_patch_size
Out[36]:
np.float64(208.43590699987567)
In [37]:
### Make kernel to calculate edge density
kernel = np.array([
    [1, 1, 1], 
    [1, -8, 1], 
    [1, 1, 1]])

### Calculate edge density
edges = convolve(veg_mask, kernel, mode='constant')

edge_density = np.sum(edges != 0) / veg_mask.size

edge_density
Out[37]:
np.float64(0.14472529362166736)

Repeat for all tracts¶

Try It
  1. Using a loop, for each Census Tract:

    1. Using a loop, for each data URL:

      1. Use rioxarray to open up a connection to the STAC asset, just like you would a file on your computer
      2. Crop and then clip your data to the census tract boundary > HINT: check out the .clip_box parameter auto_expand and the clip parameter all_touched to make sure you don’t end up with an empty array
      3. Compute NDVI for the tract
    2. Merge the NDVI rasters

    3. Compute:

      1. total number of pixels within the tract
      2. fraction of pixels with an NDVI greater than .12 within the tract (and any other statistics you would like to look at)
    4. Accumulate the statistics in a file for later

  2. Using a conditional statement, ensure that you do not run this computation if you have already saved values. You do not want to run this step many times, or have to restart from scratch! There are many approaches to this, but we actually recommend implementing your caching in the previous cell when you generate your dataframe of URLs, since that step can take a few minutes as well. However, the important thing to cache is the computation.

In [38]:
## Make CSV for NDVI data
ndvi_stats_path_veg = os.path.join(data_dir, 'den-ndvi-stats-veg.csv')
In [39]:
### Skip this step if data are already downloaded 
if not scene_df is None:

    ### Loop through the census tracts with URLs
    for tract, tract_date_gdf in tqdm(scene_df.groupby('tract')): 

        ### Open all images for tract
        ## Create a list to store NDVI arrays, with 1 array per NAIP image
        tile_das = []

        ## iterate over rows in tract-specific data frame
        for _, href_s in tract_date_gdf.iterrows():\
        
            ### Open vsi connection to data
            tile_da = rxr.open_rasterio(

                ## mask and squeeze
                href_s.rgbir_href, masked = True).squeeze()    
            
            ### Clip data to the tract geometry

            ## Get track boundary
            boundary = (
                tract_cdc_gdf 
                .set_index('tract2010') # set the indext to the 2010 tract
                .loc[[tract]]    # grab the row for that tract
                .to_crs(tile_da.rio.crs) # reproject and allign coordinates
                .geometry # grab geometry only
            )  

            ## Crop the raster to the bounding box
            crop_da = tile_da.rio.clip_box(
                *boundary.envelope.total_bounds,
                auto_expand = True # ensure that we're getting everything on the boundary
            )

            ## Clip to actual tract geometry
            clip_da = crop_da.rio.clip(boundary, all_touched = True)

            ### Compute NDVI
            ndvi_da = (
            (clip_da.sel(band = 4) - clip_da.sel(band = 1)) / 
            (clip_da.sel(band = 4) + clip_da.sel(band = 1)) 

            )

            ### Accumulate result
            tile_das.append(ndvi_da)

        ### Merge data
        scene_da = rxrmerge.merge_arrays(tile_das)

        ### Mask vegetation
        veg_mask = (scene_da > .3)

        ### Calculate statistics and save data to file
        ## Count all valid pixels in the tract
        total_pixels = scene_da.notnull().sum()

        ## Count all vegetated pixels
        veg_pixels = veg_mask.sum()

        ### Identify vegetation patches
        labeled_patches, num_patches = label(veg_mask) # label each patch to keep track of them

        ## Count patch pixels
        patch_sizes = np.bincount(labeled_patches.ravel())[1:]

        ## Calculate mean patch size
        mean_patch_size = patch_sizes.mean() # Take mean of all patch sizes in that tract

        ### Calculate edge density

        ## Define a kernel
        kernel = np.array([
            [1, 1, 1],
            [1, -8, 1],
            [1, 1, 1]])
        
        ## Detect boundaries between veg and non-veg
        edges = convolve(veg_mask, kernel, mode = 'constant')

        ## Calculate proportation of edge pixels by tract area
        edge_density = np.sum(edges != 0) / veg_mask.size
        
       
        ### Add a row to the statistics file for this tract


        ## Create new dataframe for the tract and append it to the csv
        pd.DataFrame(dict(  # Create a dictionary and then turn that dictionary into a dataframe using panda
            
            ## Store tract ID
            tract = [tract],

            ## Store total pixels
            total_pixels = [int(total_pixels)],

            ## Store fraction of veg pixels
            frac_veg = [float(veg_pixels / total_pixels)],

            ## Store mean patch size
            mean_patch_size = [mean_patch_size],

            ## Store edge density
            edge_density = [edge_density]

        ## Write out as a CSV and save to location we've identified
        )).to_csv(
            ndvi_stats_path_veg,

            ## Append mode, if CSV already exists, add to it, don't overwrite it
            mode = 'a',

            ## Get rid of row nubmers
            index = False,

            ## Write column names if the CSV doesn't exist yet
            header = (not os.path.exists(ndvi_stats_path_veg))
        )
### Re-load results from file
  0%|          | 0/112 [00:00<?, ?it/s]
In [40]:
## Check the file
ndvi_stats_df = pd.read_csv(ndvi_stats_path_veg)
ndvi_stats_df
Out[40]:
tract total_pixels frac_veg mean_patch_size edge_density
0 8031000102 5321205 0.315051 208.435907 0.144725
1 8031000201 5804883 0.189782 108.165047 0.099341
2 8031000202 4920357 0.232858 117.814190 0.158857
3 8031000301 5354592 0.298791 176.550651 0.198788
4 8031000302 4011626 0.297961 158.760393 0.216279
... ... ... ... ... ...
331 8031015300 3194630 0.288895 178.755375 0.114123
332 8031015400 8819672 0.308673 216.132026 0.137208
333 8031015500 2592303 0.130041 144.742808 0.091326
334 8031015600 10378313 0.144731 79.930928 0.071562
335 8031015700 5831357 0.274495 169.491847 0.143654

336 rows × 5 columns

STEP 4 - Explore your data with plots¶

Chloropleth plots¶

Before running any statistical models on your data, you should check that your download worked. You should see differences in the asthma rates and mean NDVI across the City.

Try It

Create a plot that contains:

  • 2 side-by-side chloropleth plots
  • Asthma prevelence on one plot and mean NDVI on the other
  • Make sure to include a title and labeled color bars
In [41]:
### Merge census data with geometry
den_ndvi_cdc_gdf = (

    ## Grab census tract geometry
    tract_cdc_gdf

    ## Merge with NDVI dataframe
    .merge(
        ndvi_stats_df, 

        ## match on the tract ID
        ## left: tract2010
        ## right: tract
        left_on = 'tract2010',
        right_on = 'tract',
        how = 'inner' ## keep tracts from both data sets
    )
)

## check it out
den_ndvi_cdc_gdf
Out[41]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry year tract_x asthma asthma_ci_low asthma_ci_high data_value_unit totalpopulation totalpop18plus tract_y total_pixels frac_veg mean_patch_size edge_density
0 0820000 8031000102 08 Denver 0820000-08031000102 3109 POLYGON ((-11691351.798 4834636.885, -11691351... 2023 8031000102 10.2 9.1 11.4 % 3622 3017 8031000102 5321205 0.315051 208.435907 0.144725
1 0820000 8031000102 08 Denver 0820000-08031000102 3109 POLYGON ((-11691351.798 4834636.885, -11691351... 2023 8031000102 10.2 9.1 11.4 % 3622 3017 8031000102 5321205 0.315051 208.435907 0.144725
2 0820000 8031000102 08 Denver 0820000-08031000102 3109 POLYGON ((-11691351.798 4834636.885, -11691351... 2023 8031000102 10.2 9.1 11.4 % 3622 3017 8031000102 5321205 0.315051 208.435907 0.144725
3 0820000 8031000201 08 Denver 0820000-08031000201 3874 POLYGON ((-11688301.532 4835632.272, -11688302... 2023 8031000201 9.8 8.6 10.9 % 3913 3222 8031000201 5804883 0.189782 108.165047 0.099341
4 0820000 8031000201 08 Denver 0820000-08031000201 3874 POLYGON ((-11688301.532 4835632.272, -11688302... 2023 8031000201 9.8 8.6 10.9 % 3913 3222 8031000201 5804883 0.189782 108.165047 0.099341
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
331 0820000 8031015600 08 Denver 0820000-08031015600 6498 POLYGON ((-11688537.194 4817797.736, -11688985... 2023 8031015600 9.8 8.7 10.8 % 6618 4813 8031015600 10378313 0.144731 79.930928 0.071562
332 0820000 8031015600 08 Denver 0820000-08031015600 6498 POLYGON ((-11688537.194 4817797.736, -11688985... 2023 8031015600 9.8 8.7 10.8 % 6618 4813 8031015600 10378313 0.144731 79.930928 0.071562
333 0820000 8031015700 08 Denver 0820000-08031015700 5535 POLYGON ((-11691342.668 4817773.296, -11691342... 2023 8031015700 10.1 9.0 11.2 % 5907 4625 8031015700 5831357 0.274495 169.491847 0.143654
334 0820000 8031015700 08 Denver 0820000-08031015700 5535 POLYGON ((-11691342.668 4817773.296, -11691342... 2023 8031015700 10.1 9.0 11.2 % 5907 4625 8031015700 5831357 0.274495 169.491847 0.143654
335 0820000 8031015700 08 Denver 0820000-08031015700 5535 POLYGON ((-11691342.668 4817773.296, -11691342... 2023 8031015700 10.1 9.0 11.2 % 5907 4625 8031015700 5831357 0.274495 169.491847 0.143654

336 rows × 20 columns

In [42]:
### Plot chloropleths with vegetation statistics

## Make a function - plot_chloropleth
def plot_chloropleth(gdf, **opts):

    ## docstring to describe what the function is doing
    """Generate a chloropleth with the given color columnn"""

    ## Use geoviews to make a polygon object
    return gv.Polygons(

        ## Use mercator projection for map
        gdf.to_crs(ccrs.Mercator()),

        ## Set plot to display in mercator as well
        crs = ccrs.Mercator()

    ## Drop x and y axis and add a legend
    ).opts(xaxis = None, yaxis = None, colorbar=True, **opts)
In [43]:
## Check that function was created
plot_chloropleth
Out[43]:
<function __main__.plot_chloropleth(gdf, **opts)>
In [55]:
## Apply the function to Chicago data
(
    plot_chloropleth(

        ## Plot asthma by census tract
        den_ndvi_cdc_gdf, color = 'asthma', cmap = 'viridis', title = 'Asthma Prevalence across Denver', colorbar_opts = {'title':'Percent Asthma Rate'})
    +

    ## add a second plot with vegetation edge density
    plot_chloropleth(
        den_ndvi_cdc_gdf, color = 'edge_density', cmap = 'Greens', title = 'Vegetation Density across Denver',colorbar_opts = {'title':'Percent Vegetation'})
)
Out[55]:

Do you see any similarities in your plots? Do you think there is a relationship between adult asthma and any of your vegetation statistics in Denver Relate your visualization to the research you have done (the context of your analysis) if applicable.

In the plots above we can see that there are high rates of asthma in Northwest Denver and also in a specific tract in north-central Denver. Just looking at these maps, it seems that there is also limited vegetation in these regions as well. This seems to show that there could be an association between these two variables (moreso than studies looking at Chicago, for example). As mentioned previously, these high asthma rates in the northeast also could be associated with heavy industry near Commerce City for example.

STEP 5: Explore a linear ordinary least-squares regression¶

Model description¶

One way to find if there is a statistically significant relationship between asthma prevalence and greenspace metrics is to run a linear ordinary least squares (OLS) regression and measure how well it is able to predict asthma given your chosen fragmentation statistics.

Before fitting an OLS regression, you should check that your data are appropriate for the model.

Reflect and Respond

Write a model description for the linear ordinary least-squares regression that touches on:

  1. Assumptions made about the data
  2. What is the objective of this model? What metrics could you use to evaluate the fit?
  3. Advantages and potential problems with choosing this model.

For our model, we create a predicted asthma variable using OLS regression, and vegetation variables (edge density, patch size, and fraction of vegetation). In this data, we assume there is some association between vegetation and asthma rates. Ideally that higher vegetation would lead to improved asthma rates. We will use the OLS regression to see if this association is correct. This model assumed that asthma rates DEPEND on all of the other vegetation variables. This could be problematic because there could be other variables playin a role (like urbanization and industrialization)

Step 5a - Data preparation¶

When fitting statistical models, you should make sure that your data meet the model assumptions through a process of selection and/or transformation.

You can select data by:

  • Eliminating observations (rows) or variables (columns) that are missing data
  • Selecting a model that matches the way in which variables are related to each other (for example, linear models are not good at modeling circles)
  • Selecting variables that explain the largest amount of variability in the dependent variable.

You can transform data by:

  • Transforming a variable so that it follows a normal distribution. The log transform is the most common to eliminate excessive skew (e.g. make the data symmetrical), but you should select a transform most suited to your data.
  • Normalizing or standardizing variables to, for example, eliminate negative numbers or effects caused by variables being in a different range.
  • Performing a principle component analysis (PCA) to eliminate multicollinearity among the predictor variables

Tip

Keep in mind that data transforms like a log transform or a PCA must be reversed after modeling for the results to be meaningful.

Try It
  1. Use the hvplot.scatter_matrix() function to create an exploratory plot of your data.
  2. Make any necessary adjustments to your data to make sure that they meet the assumptions of a linear OLS regression.
  3. Check if there are NaN values, and if so drop those rows and/or columns. You can use the .dropna() method to drop rows with NaN values.
  4. Explain any data transformations or selections you made and why
In [45]:
## Are there any columns with missing values?
den_ndvi_cdc_gdf.isna().any()

## Are there any rows with missing values?
den_ndvi_cdc_gdf.isna().any(axis = 1)

## Are there any missing data in rows OR columns?
den_ndvi_cdc_gdf.isna().any().any()

## How many missing values are in columns?
den_ndvi_cdc_gdf.isna().sum()

## True = there is missing data
Out[45]:
place2010          0
tract2010          0
ST                 0
PlaceName          0
plctract10         0
PlcTrPop10         0
geometry           0
year               0
tract_x            0
asthma             0
asthma_ci_low      0
asthma_ci_high     0
data_value_unit    0
totalpopulation    0
totalpop18plus     0
tract_y            0
total_pixels       0
frac_veg           0
mean_patch_size    0
edge_density       0
dtype: int64
In [46]:
### Visualize distribution of data


## Select variables
variables = ['frac_veg', 'asthma', 'mean_patch_size', 'edge_density']

## Make a scatterplot
hvplot.scatter_matrix(

    den_ndvi_cdc_gdf
    [variables]
)
Out[46]:
In [47]:
## Histograms
## Make facet
fig, axes = plt.subplots(2, 2, figsize = (12,10))

## list variables to plot
variables = ['frac_veg', 'asthma', 'mean_patch_size', 'edge_density']
titles = ['Vegetated fraction', 'Adult Asthma Rate', 'Mean Patch Size', 'Edge Density']

## Loop through variables and axes to plot histograms
for i, (var, title) in enumerate(zip(variables, titles)):
    ax = axes[i//2, i%2]
    ax.hist(den_ndvi_cdc_gdf[var], bins = 10, color = 'gray', edgecolor = 'black')
    ax.set_title(title)
    ax.set_xlabel(var)
    ax.set_ylabel('Frequency')

## Adjust layers to prevent overlap
plt.tight_layout()
plt.show()
No description has been provided for this image
In [48]:
## Drop missing observations
model_df = (
    den_ndvi_cdc_gdf

    ## Make a copy of the dataframe
    .copy()

    ## Subset to columns
    [['frac_veg', 'asthma', 'mean_patch_size', 'edge_density', 'geometry']]

    ## Drop rows (observations) with missing values
    .dropna()
)
model_df
Out[48]:
frac_veg asthma mean_patch_size edge_density geometry
0 0.315051 10.2 208.435907 0.144725 POLYGON ((-11691351.798 4834636.885, -11691351...
1 0.315051 10.2 208.435907 0.144725 POLYGON ((-11691351.798 4834636.885, -11691351...
2 0.315051 10.2 208.435907 0.144725 POLYGON ((-11691351.798 4834636.885, -11691351...
3 0.189782 9.8 108.165047 0.099341 POLYGON ((-11688301.532 4835632.272, -11688302...
4 0.189782 9.8 108.165047 0.099341 POLYGON ((-11688301.532 4835632.272, -11688302...
... ... ... ... ... ...
331 0.144731 9.8 79.930928 0.071562 POLYGON ((-11688537.194 4817797.736, -11688985...
332 0.144731 9.8 79.930928 0.071562 POLYGON ((-11688537.194 4817797.736, -11688985...
333 0.274495 10.1 169.491847 0.143654 POLYGON ((-11691342.668 4817773.296, -11691342...
334 0.274495 10.1 169.491847 0.143654 POLYGON ((-11691342.668 4817773.296, -11691342...
335 0.274495 10.1 169.491847 0.143654 POLYGON ((-11691342.668 4817773.296, -11691342...

336 rows × 5 columns

In [49]:
### Perform variable transformation

## Log of asthma rate
model_df['log_asthma'] = np.log(model_df.asthma)

## Log of patch size
model_df['log_patch'] = np.log(model_df.mean_patch_size)
In [50]:
### Visualize transformed variables
hvplot.scatter_matrix(
    model_df
    [[
        'frac_veg',
        'edge_density',
        'log_patch',
        'log_asthma'
    ]]
)
Out[50]:
In [51]:
## Q-Q plots

## Set variables
var_qq = ['frac_veg', 'edge_density', 'log_patch', 'log_asthma']

## Make Q-Q plot for each variable
plt.figure(figsize = (12,10))
for i, var in enumerate(var_qq, 1):

    ## Make 2 x 2 facet
    plt.subplot(2, 2, i)

    ## Norm distribution
    stats.probplot(model_df[var], dist = "norm", plot = plt)

## Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Reflect and respond: EXPLAIN YOUR SELECTION AND TRANSFORMATION PROCESS HERE

This QQ Plot reveals that this data follows a normal distribution pretty well. The lowest and highest ordered values appear to have follow the least normal distribution

In [ ]:
 

Step 5b - Fit and Predict¶

If you have worked with statistical models before, you may notice that the scikitlearn library has a slightly different approach than many software packages. For example, scikitlearn emphasizes generic model performance measures like cross-validation and importance over coefficient p-values and correlation. The scikitlearn approach is meant to generalize more smoothly to machine learning (ML) models where the statistical significance is harder to derive mathematically.

Try It
  1. Use the scikitlearn documentation or ChatGPT as a starting point, split your data into training and testing datasets.
  2. Fit a linear regression to your training data.
  3. Use your fitted model to predict the testing values.
  4. Plot the predicted values against the measured values. You can use the following plotting code as a starting point.
In [52]:
### Select predictor and outcome variables
X =  model_df[['edge_density', 'log_patch']] # Explanatory variables
y = model_df[['log_asthma']]

### Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size = 0.33, random_state = 19 # Make sure data split is reproducible, "seed"
)

### Fit a linear regression
reg = LinearRegression()

## Fit to our data
reg.fit(X_train, y_train)

### Predict asthma values for the test dataset
y_test['pred_asthma'] = np.exp(reg.predict(X_test)) # Unlog, make a column for predict asthma

## Real asthma rate for comparison
y_test['asthma'] = np.exp(y_test.log_asthma)

## Check it out
y_test
Out[52]:
log_asthma pred_asthma asthma
92 2.322388 10.479209 10.2
293 2.351375 10.547584 10.5
27 2.617396 10.755577 13.7
143 2.341806 10.280625 10.4
158 2.312535 10.290222 10.1
... ... ... ...
315 2.272126 10.176164 9.7
16 2.341806 10.329474 10.4
237 2.312535 10.337080 10.1
128 2.292535 10.161204 9.9
137 2.379546 10.270777 10.8

111 rows × 3 columns

In [53]:
### Plot measured vs. predicted asthma prevalence with a 1-to-1 line

## Scale y-axis
y_max = y_test.asthma.max()

(
    y_test
    .hvplot.scatter(x='asthma', 
                    y='pred_asthma',
                    xlabel = "Measured adult asthma prevalence",
                    ylabel = "Predicted adult asthma prevalence",
                    title = "Linear Regression Performance - Testing Data"
                    )

    .opts(aspect='equal', 
          xlim=(0, y_max), ylim=(0, y_max), 
          width=600, height=600)
) * hv.Slope(slope=1, y_intercept=0).opts(color='black')

## EAch observation is a different census tract
## Not many points follow the line
## Every model has errors 
Out[53]:

Step 5c - Explore spatial bias¶

We always need to think about bias, or systematic error, in model results. Every model is going to have some error, but we’d like to see that error evenly distributed. When the error is systematic, it can be an indication that we are missing something important in the model.

In geographic data, it is common for location to be a factor that doesn’t get incorporated into models. After all – we generally expect places that are right next to each other to be more similar than places that are far away (this phenomenon is known as spatial autocorrelation). However, models like this linear regression don’t take location into account at all.

Try It
  1. Compute the model error (predicted - measured) for all census tracts
  2. Plot the error as a chloropleth map with a diverging color scheme
  3. Looking at both of your error plots, what do you notice? What are some possible explanations for any bias you see in your model?
In [54]:
### Compute model error for all census tracts

## Use the trained model to predict asthma prevalence for each census tract
model_df['pred_asthma'] = np.exp(reg.predict(X))

## Calculate model error
model_df['error_asthma'] = model_df['pred_asthma'] - model_df['asthma']

### Plot error geographically as a chloropleth
(
 plot_chloropleth(model_df, color = 'error_asthma', cmap = 'RdBu')

 ## Set frame width and aspect ratio
 .opts(frame_width = 600, aspect = 'equal')   
)

## Add titles and legends etc. 

## 
Out[54]:
Reflect and Respond

What do you notice about your model from looking at the error plots? What additional data, transformations, or model type might help improve your results?

Overall it seems like this model does a pretty good job at revealing an association of asthma and vegetation levels in Denver. I think controlling for data such as industrial pollution rates, age, income, or other demographics could improve our model immensely. Using a multiple regression model, where multiple independent variables could have an effect on asthma rates would probably provide more accurate results.